CPSC 545/445 (Autumn 2003) - Class 18:  Gene Expression Analysis (2)
Module 6, Part 2

----

brief recap:

key idea behind microarry technology

--
(6.1 continued)

Serial Analysis of Gene Expression (SAGE) - Velculescu et al., 1995


Key ideas:

* A short sequence tag (10-14 bases) contains sufficient information 
	to uniquely identify a transcript (mRna, cDNA, etc.)
* Sequence tags can be linked together to form 
	concatemers = long molecules that can be cloned and sequenced
* Quantitation of the number of times a particular tag is observed 
	provides the expression level of the corresponding transcript 
[from http://www.sagenet.org/]


Basic steps of SAGE:

1. 	Trap RNAs with beads (using magnetic beads with attached poly-T oligos)
2. 	Convert the RNA into cDNA (using reverse transcriptase)
3. 	Make a cut in each cDNA so that there is a broken end sticking out
		(using restriction enzyme, recognition sequence
		is typically 4bp long)
4. 	Attach a "docking module" to this end; here a new enzyme can dock,
	 	reach down the molecule, and cut off a short tag
		(using another restriction enzyme that cuts
		away from its recognition site, which is located
		in the docking module)
5. 	Combine two tags into a unit, a di-tag (using ligase)
6. 	Make billions of copies of the di-tags (using PCR)
7. 	Remove the modules and glue the di-tags together into long concatamers
		(using restriction enzymes and ligase to join blunt ends)
8. 	Put the concatamers into bacteria and copy them millions of times
		(using standard cloning vector)
9. 	Pick the best concatamers (length = 25-75 tags) and sequence them
		using high-throughput sequencer
10. 	Use software to identify how many different cDNAs there are, and count them;
11. 	Match the sequence of each tag to the gene that produced the RNA.
		(alignment against existing genomic data)

[based on http://www.embl-heidelberg.de/info/sage/,
 for a more detailed description, see http://www.genetics.pitt.edu/sage/#over]

Note: Step 5 (ditag formation) is important in order to recognise
	and prevent asymmetric PCR amplification of tags in Step 6.
	Ditags help to avoid this problem - since the probability
	for forming each ditag is very small, 
	presence of repeated ditags indicates invalid PCR result.

sources of errors:
* Tags may not always uniquely identify transcripts
* One transcript may have more than one tag (if multiple poly-A sites are present)
* Some transcripts may not contain restriction site for anchoring enzyme
	-> won't be detected by SAGE
* restriction enzymes are not 100% precise -> varying length
	of tags, misinterpretation of ditags (e.g., 14+14=28=12+16)

major advantage of sage compared to microarrays:
- can identify unknown targets (gene finding)
- can be quantitatively very precise

--
briefly mention: oligonucleotide fingerprinting (ONF) - Lennon & Lehrach, 1991
	- see Shamir & Sharan, 2002
  originally motivated by "sequencing by hybridisation"
  spot cDNA on membrane (up to 31,000)
  sequentially hybridise with p oligos (7-12 bases long)
  -> fingerprint vector with p 0's/1's for each cDNA spot
  (barcode analogy)
  estimate expression level from #spots with similar / identical fingerprints
  (identify cDNAs for same gene by clustering)

briefly mention: new techniques for detecting proteins directly
	(protein arrays; based on various capturing techniques,
	such as antigen-antibody, specific protein/protein, 
	adaptamer/protein, receptor/ligand, enzyme/sustrate interactions)
	(not covered here)

----
6.2 clustering gene expression data

goal: identify groups of genes with sim expression patterns

-> clustering of gene expression data:

given: 

elements = genes
characteristic vector for each gene = expression level under various conditions
	(each element of this vector stems from one
	expression level measurement)

measure of similarity between pairs of vectors (e.g., corr coeff of vectors)
	(can also use measure of distance)

want: partition elements (genes) into subsets (clusters) such that:
	- elements in same cluster are highly similar (homogeneity)
	- elements from differ clusters have low similarity (separation)

-> optimisation problem

[hom = avg sim between elements and their clusters -> min]
[sep = weigthed avg sim betw clusters -> max]

note:
- can use various measures of similarity; homogeneity & separation 
- objectives (hom, sep) often conflict


clustering algorithms: (many developed for broad range of applications)

- in many cases, "clean" clustering alg with proven properties
	were developed first; then more efficient heuristic alg derived
	from these.

--
hierarchical clustering:

- bottom up (agglomerative clustering, see UPGMA) 
	or top down (repeated partitioning of sets of elements)

agglomerative hierarchical clustering:

- input dissimilarity matrix D=(d_ij)
- start with partitioning into singleton sets
- in each step, join two clusters with minimal dissimilarity (max sim)
- update dissimilarity matrix

- output: dendrogram (tree of hierch cluster struct)
	ordered fingerprint matrix

[slide: algorithm outline for agglom hier clustering]

note: several variants, differ in def of dissim between clusters (upd matrix)
	(diff parameters alpha_*, gamma in outline)s

-> shamir & sharan, 2002, Section 11.4.1

-> Cluster / TreeView software (Eisen et al., 1998)
  (similarity measure: form of correlation coefficient)


--
k-means (first proposed in 1965!):

- assume number of cluster is known
- given: fingerprint vectors
- goal: minimise distance between elements and centroids of assigned cluster


for cluster j, c(j) = centroid  of cluster j = mean of fingerprint vectors in j

for element i, P(i) = cluster assigned to i by partition P

d(v1,v2) = Euclidian distance between (fingerprint) vectors v1,v2

K-means tries to find partition P such that 
E_p = sum_i=1..n d(i,c(P(i)) is minimised

idea: local search over partitions:
	each iteration modifies current P by moving one element between clusters
	to reduce E_p

[slide: alg outline]

(can also move in one iter all elem that would benefit from move)

-> easy to implement, used in many applications

variant of Herwig et al. (1999) works for unspecified number of clusters


--
CAST - Clustering Affinity Search Technique (Ben-Dor et al., 1999)

- motivated by polynomial time alg for finding true clustering with high prob 
	under prob model of data (Ben-Dor et al., 1999)

input: similarity matrix S

affinity of elem v to (putative) cluster C:
a(v) = sum{S(i,v) | i \in C } 

idea: generate clusters one by one;
	start cluster with single element, 
	repeat
		add element with max affinity a if a > t|C|
		remove element with min affinity a if a <= t|C|
	until no change

additional iterative improvement local search at end of alg

[slide: alg outline of CAST], t is a parameter

- implemented in package BIOCLUST


--
other methods:

- HCS, CLICK:
  recent clustering methods based on graph theoretic approach 

- self-organising maps (Kohonen networks)
  "unsupervised learning alg"
  assumes known number of clusters
  iterative method
  implemented in GeneCluster software (Tamayo et al., 1999)

- large margin classifiers
  also: support vector machines, boosting

--
notes on clustering (from conclusions of Shamir and Sharan
- clustering is (to some extent) an art
- no universal, agreed-upon criteria for eval of solutions
- no ultimate alg (problem too general; but true even for gene expression)
- further improvement of alg needs more data (real + synthetic)


---
Resources:

* Roded Sharan, Ron Shamir: Algorithmic Approaches to Clustering Gene Expression Data
In: T. Jiang, T. Smith, Y. Xu, M.Q. Zhang (eds.) 
Current Topics in Computational Biology, pages 269–300. 
MIT Press, Cambridge, Massachusetts, 2002.


Further Reading:


---